Analyzing Retweet Dynamics of German MPs in the 19th Legislative Period

Assignment 2: Data Analysis Exercise

Author

Lukas Warode

In a data analysis exercise a real-world, but pre-defined and cleaned data set, is assigned from a small collection of examples. A list of descriptive measures, as well as community detection methods should be applied, using a programming language and libraries of choice (which have been introduced in the course before). In a short report, either including the code (e.g., a python notebook) or a text document with plots (in this case the code should be submitted additionally), the results should be presented and their meaning interpreted and discussed with respect to the dataset at hand. Students will have two weeks to solve this task individually at home. Grading will be one half the correct application of the descriptive measures and the other half the presentation (i.e., plots), the interpretation and the discussion of those results.

library(tidyverse)
library(igraph)
library(tidygraph)
library(ggraph)

Load and Manipulate Dataset

In my Master’s Thesis, I scale German MPs (MdBs) of the 19th legislative period across ideological dimensions that are determined in an unsupervised way by using parliamentary speeches and Tweets. While retweet dynamics do not play a big role in the scaling algorithm per se, party affiliation does and is particularly interesting for network analytical approaches, especially in the realm of clustering.

tweet_df_final <- readRDS("data/tweet_df_final_2.rds")

# further meta information 
od_tw_info <- read_csv("data/od_tw_merge_final.csv")

How many Tweets are there in total? Over 1.3 Million - quite a lot.

tweet_df_final %>% 
  nrow()
[1] 1301427

How many Twitter users are there in total across all MPs of the 19th legislative period?

tweet_df_final %>% 
  select(author_id, tw_id) %>% 
  distinct() %>% 
  nrow()
[1] 515

Create data.frame that exhibits information of referenced Tweets

retweet_df <- tweet_df_final %>% 
  select(referenced_tweets, tweet_type_character, tweet_type_final, id) %>% 
  mutate(
    # type of referenced tweet
    ref_tweet_types = map(1:length(referenced_tweets),
                          ~ referenced_tweets[[.x]][["type"]]),
    # id of referenced tweet (NOT id of user)
    ref_tweet_id = map(1:length(referenced_tweets),
                       ~ referenced_tweets[[.x]][["id"]])
  )  

Get author information of Tweets that were referenced

author_id_ref_tweet <- retweet_df %>% 
  filter(ref_tweet_types == "retweeted") %>% 
  mutate(ref_tweet_id = unlist(ref_tweet_id)) %>% 
  left_join(
    tweet_df_final %>% select(id, author_id, tw_id), 
    by = c("ref_tweet_id" = "id")
  ) %>% 
  filter(!is.na(author_id), !is.na(tw_id)) %>% 
  select(id, ref_tweet_id, author_id, author_id, tw_id)

Create final data.frame that only consists of retweets that occured within the dataset

Filter for retweets and merge author information from retweeted Tweets.

# vector with IDs from Tweets that are from authors within the dataset
tweet_ids <- tweet_df_final$id

retweet_within_df <- retweet_df %>% 
  filter(ref_tweet_types == "retweeted") %>% 
  filter(ref_tweet_id %in% tweet_ids) %>% 
  left_join(tweet_df_final %>% select(id, author_id, tw_id)) %>% 
  # directedness: from (orginal tweet) -> to (retweeted tweet) 
  left_join(author_id_ref_tweet, by = "id", suffix = c("_to", "_from"))

How many Tweets are left after constructing the “within-retweet data.frame”? Just about 100000 or 7.7% of all Tweets.

retweet_within_df %>% nrow()
[1] 100311
paste0((nrow(retweet_within_df) / nrow(tweet_df_final) * 100) %>% round(1), "%")
[1] "7.7%"

Validate Retweet-Network Structure

retweet_within_df %>% select(tw_id_from, tw_id_to) %>% head(5)
      tw_id_from       tw_id_to
1   katjakipping dorisachelwilm
2    victorperli dorisachelwilm
3  niemamovassat dorisachelwilm
4 michel_brandt_ dorisachelwilm
5 dietmarbartsch dorisachelwilm

This makes actually sense! Doris Achelwilm (MP from Left Party) retweets Katja Kipping and Dietmar Bartsch, which do both play a central role in the Left Party: They were party leaders in the 19th legislative period.

Plot most retweeted MPs

First, let us create a vector with party colors.

party_colors <- c("#138BD8", "#000000", "#FFEE0A", "#529222", "#AE1862", "#E30019")
names(party_colors) <- c("AfD", "CDU/CSU", "FDP", "Grüne", "DIE LINKE.", "SPD")

We also create a ggplot2 theme, which is based on theme_bw(), but has some refinements.

theme_custom <- function() {
  font <- "Corbel"
  # base theme
  theme_bw() %+replace%
    theme(
      text = element_text(family = font),
      legend.text = element_text(size = 9),
      legend.title = element_blank(),
      legend.box.background = element_rect(
        colour = "black",
        fill = "white",
        linetype = "solid"
      ),
      # grid lines
      panel.grid.major = element_line(color = "grey60", size = 0.2),
      panel.grid.minor = element_line(color = "grey80", size = 0.1),
      # faceting
      strip.background = element_blank(),
      strip.text = element_text(color = "black"),
    )
}

Now, we set the new theme as default.

theme_set(theme_custom())

We also create a data.frame that just includes the party affiliation per MP.

party_user <- tweet_df_final %>% 
  distinct(tw_id, party)

Plot Top 20

Now, we count how often per MP a Tweet was retweeted by another MP. We plot the to 20 MPs in total and visualize the party affiliation.

retweet_within_df %>% 
  group_by(tw_id_from) %>%
  count() %>%
  ungroup() %>% 
  arrange(desc(n)) %>% 
  top_n(20) %>% 
  left_join(party_user, by = c("tw_id_from" = "tw_id")) %>%
  ggplot(aes(x = n, y = reorder(tw_id_from, n), fill = party)) + 
  geom_col() +
  scale_fill_manual(values = party_colors) +
  labs(y = "", title = "Most Prominent MPs by Retweets of other MPs") +
  theme_custom()

Interestingly, the AfD and FDP dominate the chart, while there is no single SPD MP within the top 20 most retweeted MPs within. Let’s check the pattern and plot by party! Overall, the results seem to make sense, given that party leaders and influential MPs are the most retweeted MPs according to the bar plot.

Plot Top 20 by Party

Now, let’s turn the attention to the most retweeted MPs by party. Both CDU/CSU and SPD, the biggest parties and also only members of the governing coalition of the 19th legislative period do not have as many influential MPs, measured by number of retweets across their colleagues, as opposition parties have.

retweet_within_df %>% 
  left_join(party_user, by = c("tw_id_from" = "tw_id")) %>%
  filter(!party %in% c("Fraktionslos", NA)) %>% 
  group_by(tw_id_from, party) %>%
  count() %>%
  group_by(party) %>% 
  arrange(desc(n)) %>%
  top_n(10) %>% 
  ggplot(aes(x = n, y = reorder(tw_id_from, n), fill = party)) + 
  geom_col() +
  facet_wrap(~ party, scales = "free_y", ncol = 2) +
  scale_fill_manual(values = party_colors) +
  labs(y = "", title = "Most Prominent MPs by Retweets of other MPs", subtitle = "Grouped by Party") +
  theme_custom()

Construct Retweet-Network

In a Retweet-Network each retweet is represented as a directed link in a network that connects from the original Tweet to the referenced retweet. Users that are often retweeted within the network are expected to be associated with some degree of (digital) political relevance, even though there a lot of potential confounding variables (e.g. number of Tweets, number of Followers, general social media behaviour and association with socio-demographic factors such as cohort affiliation).

retweet_network <- retweet_within_df %>% 
  select(from = tw_id_from, to = tw_id_to) %>% 
  as_tbl_graph(directed = TRUE) %>% 
  left_join(party_user, by = c("name" = "tw_id")) 
 
# get corresponding row ID (from ID) 
row_id_from <- retweet_network %>% 
  as_tibble() %>%
  mutate(id_from = row_number())

# update retweet network with edge information: name and party of retweeted users (from)
retweet_network <- retweet_network %>% 
  activate(edges) %>% 
  left_join(row_id_from, by = c("from" = "id_from")) %>% 
  rename(party_from = party, name_from = name) %>% 
  activate(nodes)

retweet_network
# A tbl_graph: 487 nodes and 100311 edges
#
# A directed multigraph with 1 component
#
# Node Data: 487 × 2 (active)
  name           party     
  <chr>          <chr>     
1 katjakipping   DIE LINKE.
2 victorperli    DIE LINKE.
3 niemamovassat  DIE LINKE.
4 michel_brandt_ DIE LINKE.
5 dietmarbartsch DIE LINKE.
6 susanneferschl DIE LINKE.
# … with 481 more rows
#
# Edge Data: 100,311 × 4
   from    to name_from     party_from
  <int> <int> <chr>         <chr>     
1     1    44 katjakipping  DIE LINKE.
2     2    44 victorperli   DIE LINKE.
3     3    44 niemamovassat DIE LINKE.
# … with 100,308 more rows

Degree Centrality

Out-degree

In the previous plots, we already calculated the out-degree centrality (which is the same as the number of outgoing edges per node or the number of original Tweets that have been retweeted within). Nevertheless, double-checking is always a good practice!

# %N>% activate nodes (tidygraph syntax)
retweet_network %N>%
  mutate(
    # in_degree = centrality_degree(mode = "in"),
    out_degree = centrality_degree(mode = "out")
  ) %>% 
  arrange(desc(out_degree)) %>% 
  select(name, out_degree, party) %>% 
  as_tibble() %>% 
  head(20) %>% 
  knitr::kable()
name out_degree party
c_lindner 4066 FDP
udohemmelgarn 3983 AfD
alice_weidel 3853 AfD
frank_pasemann 3794 AfD
brihasselmann 2475 Grüne
marcobuschmann 2294 FDP
schneider_afd 1830 AfD
beatrix_vstorch 1554 AfD
gtzfrmming 1330 AfD
dietmarbartsch 1208 DIE LINKE.
m_reichardt_afd 1164 AfD
konstantinkuhle 1162 FDP
olliluksic 1154 FDP
paulziemiak 1022 CDU/CSU
joanacotar 1021 AfD
konstantinnotz 956 Grüne
renner_afd 933 AfD
nicole_hoechst 921 AfD
jensspahn 901 CDU/CSU
johannesvogel 888 FDP

Now, let’s plot the distribution of the out-degree. There are a few outliers at the upper end.

retweet_network %N>%
  mutate(
    # in_degree = centrality_degree(mode = "in"),
    out_degree = centrality_degree(mode = "out")
  ) %>% 
  as_tibble() %>% 
  ggplot(aes(x = out_degree)) + 
  geom_histogram() +
  labs(x = "Out-Degree") +
  theme_custom()

The out-degree distribution is quite comparable across parties: Most MPs were never retweeted, but have important retweet hubs. However, for the Green and Left Party, the distribution seems to be a bit less skewed.

retweet_network %N>%
  filter(!party %in% c("Fraktionslos", NA)) %>% 
  mutate(
    # in_degree = centrality_degree(mode = "in"),
    out_degree = centrality_degree(mode = "out")
  ) %>% 
  as_tibble() %>% 
  ggplot(aes(x = out_degree, fill = party)) + 
  geom_histogram() +
  labs(x = "Out-Degree") +
  facet_wrap(~ party, scales = "free") +
  scale_fill_manual(values = party_colors) +
  theme_custom()

In-degree

The in-degree corresponds to the “inverse retweet behaviour”: It indicates which MPs retweeted Tweets of other MPs.

# %N>% activate nodes (tidygraph syntax)
retweet_network %N>%
  mutate(
    in_degree = centrality_degree(mode = "in"),
    # out_degree = centrality_degree(mode = "out")
  ) %>% 
  arrange(desc(in_degree)) %>% 
  select(name, in_degree, party) %>% 
  as_tibble() %>% 
  head(20) %>% 
  knitr::kable()
name in_degree party
udohemmelgarn 6002 AfD
schneider_afd 5001 AfD
frank_pasemann 3583 AfD
martinrosemann 3409 SPD
renner_afd 3270 AfD
olliluksic 2140 FDP
nicole_hoechst 1735 AfD
drandreasnick 1567 CDU/CSU
stbrandner 1519 AfD
rkiesewetter 1505 CDU/CSU
reinholdmdb 1494 FDP
c_jung77 1284 FDP
k_sa 1225 Grüne
elsnervongronow 1187 AfD
tschipanski 1006 CDU/CSU
ullinissen 991 SPD
andi_wagner 969 DIE LINKE.
gtzfrmming 868 AfD
djanecek 834 Grüne
steinekecdu 807 CDU/CSU
retweet_network %N>%
  mutate(
    in_degree = centrality_degree(mode = "in"),
    # out_degree = centrality_degree(mode = "out")
  ) %>% 
  as_tibble() %>% 
  ggplot(aes(x = in_degree)) + 
  geom_histogram() +
  labs(x = "In-Degree") +
  theme_custom()

retweet_network %N>%
  filter(!party %in% c("Fraktionslos", NA)) %>% 
  mutate(
    in_degree = centrality_degree(mode = "in"),
    # out_degree = centrality_degree(mode = "out")
  ) %>% 
  as_tibble() %>% 
  ggplot(aes(x = in_degree, fill = party)) + 
  geom_histogram() +
  labs(x = "In-Degree") +
  facet_wrap(~ party, scales = "free") +
  scale_fill_manual(values = party_colors) +
  theme_custom()

The in-degree distribution seems to be a bit more normally distributed, meaning that “inverse hubs”, or MPs that heavily retweeted other MPs, are also present but not as dominant as retweeted MP hubs. However, the AfD seems to dominate the retweet dynamic from this perspective, as the table above shows.

Comparing Out- and In-degree

In- and out-degree seem to be correlated (\(R = 0.53\)), however, there are outliers that influence the relationship heavily.

# %N>% activate nodes (tidygraph syntax)
retweet_network %N>%
  mutate(
    in_degree = centrality_degree(mode = "in"),
    out_degree = centrality_degree(mode = "out")
  ) %>% 
  select(name, in_degree, out_degree) %>% 
  as_tibble() %>% 
  left_join(party_user, by = c("name" = "tw_id")) %>%
  ggplot(aes(out_degree, in_degree)) + 
  geom_point(alpha = 0.5) +
  ggpubr::stat_cor() +
  theme_custom()

It seems that there are a few outliers at the upper end of the distribution. Let’s see whether the correlation holds when they are removed.

The correlation is still quite high (\(R = 0.38\)) and highly significant, when we remove the MPs with the top 20 highest in- and out-degrees.

top_20_in <- retweet_network %>% 
  mutate(in_degree = centrality_degree(mode = "in")) %>% 
  arrange(desc(in_degree)) %>% 
  as_tibble() %>% 
  top_n(20) %>% 
  pull(name)

top_20_out <- retweet_network %>% 
  mutate(out_degree = centrality_degree(mode = "out")) %>% 
  arrange(desc(out_degree)) %>% 
  as_tibble() %>% 
  top_n(20) %>% 
  pull(name)

# %N>% activate nodes (tidygraph syntax)
retweet_network %N>%
  mutate(
    in_degree = centrality_degree(mode = "in"),
    out_degree = centrality_degree(mode = "out")
  ) %>% 
  select(name, in_degree, out_degree) %>% 
  as_tibble() %>% 
  left_join(party_user, by = c("name" = "tw_id")) %>%
  filter(!name %in% top_20_in, !name %in% top_20_out) %>% 
  ggplot(aes(out_degree, in_degree)) + 
  geom_point(alpha = 0.5) +
  ggpubr::stat_cor() +
  theme_custom()

retweet_network %N>%
  filter(!party %in% c("Fraktionslos", NA)) %>% 
  mutate(
    in_degree = centrality_degree(mode = "in"),
    out_degree = centrality_degree(mode = "out")
  ) %>% 
  select(name, in_degree, out_degree) %>% 
  as_tibble() %>% 
  left_join(party_user, by = c("name" = "tw_id")) %>%
  filter(!name %in% top_20_in, !name %in% top_20_out) %>% 
  ggplot(aes(out_degree, in_degree)) + 
  geom_point(alpha = 0.5, aes(color = party)) +
  ggpubr::stat_cor(show.legend = F) +
  facet_wrap(~ party) +
  scale_color_manual(values = party_colors) +
  theme_custom()

Distance / Shortest Path Length

Most of the users are connected across (a path length of) 2 or 3 retweets.

distances_retweet_network <- retweet_network %>% 
  distances() %>%
  table() %>% 
  as_tibble() %>% 
  rename(Distance = 1) 

distances_retweet_network %>% 
  knitr::kable()
Distance n
0 487
1 18554
2 105456
3 101042
4 11290
5 332
6 8
distances_retweet_network %>%
  ggplot(aes(x = Distance, y = n)) + geom_col() +
  ggtitle("Distribution of Distance (Shortest Path)") +
  theme_custom()

Network Visualization

Create a new custom theme, based on ggraph::theme_graph().

theme_graph_custom <- function() {
  font <- "Corbel"
  # base theme
  theme_graph() %+replace%
    theme(
      text = element_text(family = font),
      legend.text = element_text(size = 9),
      legend.title = element_blank(),
      legend.box.background = element_rect(
        colour = "black",
        fill = "white",
        linetype = "solid"
      ),
      # faceting
      strip.background = element_blank(),
      strip.text = element_text(color = "black"),
    )
}

Standard Plot

Using ggraph auto/default options to plot all nodes and edges, we can see that there is a strong pattern of “intra-party retweeting”. Interestingly, while most parties members exhibits almost exclusively retweet within the party, there is a strong visual intersection of SPD and Left Party retweets. To a lesser extent, this also repeats for party of the Greens with the SPD

plot_all_auto <- retweet_network %>% 
  ggraph(layout = "auto") + 
  geom_edge_link(alpha = 0.015) +
  geom_node_point(aes(color = party))  +
  scale_color_manual(values = party_colors) + 
  theme_graph_custom() 

plot_all_auto

Circular

Now let’s visualize the network in a circular way, highlighting the retweet dynamics in a way that underlines the party structure of the dataset.

plot_all_circular <- retweet_network %>% 
  filter(!party %in% c("Fraktionslos", NA)) %>% 
  ggraph(layout = "linear", circular = TRUE) + 
  geom_edge_arc(aes(color = party_from), alpha = 0.015) +
  scale_edge_color_manual(values = party_colors) + 
  coord_fixed() +
  theme_graph_custom() +
  theme(legend.position = "none")

plot_all_circular

Further Centrality Measures

Closeness Centrality

The Closeness Centrality is defined as the inverse total distance of paths between a node and all other nodes in the network. Shorter paths to others are thus an indicator of node centrality.

closeness_retweet_network <- retweet_network %>% 
  mutate(Closeness = centrality_closeness())

The Closeness Centrality is not dominated by the AfD, which was exhibiting the MPs with the highest degrees. This finding underlines that centrality measures that go beyond the degree centrality play an important role when assessing the actual dynamics of the dataset.

closeness_retweet_network %>% 
  filter(Closeness != 1) %>% 
  arrange(desc(Closeness)) %>% 
  select(name, Closeness, party) %>% 
  as_tibble() %>% 
  head(20) %>% 
  knitr::kable()
name Closeness party
konstantinkuhle 0.0011416 FDP
karl_lauterbach 0.0010776 SPD
brihasselmann 0.0010672 Grüne
marcobuschmann 0.0010616 FDP
heikomaas 0.0010604 SPD
larsklingbeil 0.0010504 SPD
johannesvogel 0.0010194 FDP
rkiesewetter 0.0010194 CDU/CSU
renatekuenast 0.0010183 Grüne
thomasoppermann 0.0010183 SPD
mgrossebroemer 0.0010163 CDU/CSU
c_lindner 0.0010152 FDP
konstantinnotz 0.0010152 Grüne
schneidercar 0.0010121 SPD
katarinabarley 0.0010121 SPD
nouripour 0.0010101 Grüne
mastrackzi 0.0010091 FDP
sigmargabriel 0.0010091 SPD
lambsdorff 0.0010070 FDP
cem_oezdemir 0.0010040 Grüne

The closeness centrality seems to be actually normally distributed, while there are some outliers on the lower bound.

closeness_retweet_network %>% 
  filter(Closeness != 1) %>% 
  as_tibble() %>% 
  ggplot(aes(x = Closeness)) + 
  geom_density() +
  theme_custom()

While most parties have an approximately equally distributed Closeness Centrality, the AfD has the lowest one with a visually perceivable difference. For interpretational reasons, this is likelly due to internal heterogeneity, given the populist character and short existence period of the party.

closeness_retweet_network %>% 
  filter(!party %in% c("Fraktionslos", NA)) %>%
  filter(Closeness != 1) %>% 
  as_tibble() %>% 
  ggplot(aes(x = Closeness, y = reorder(party, Closeness), fill = party)) + 
  ggridges::stat_density_ridges(alpha = 0.5, quantile_lines = T, quantiles = 4) +
  scale_fill_manual(values = party_colors) +
  theme_custom() +
  labs(y = "")

Betweenness Centrality

TODO: explain betweenness centrality

betweenness_retweet_network <- retweet_network %>% 
  mutate(Betweenness = centrality_betweenness())
betweenness_retweet_network %>% 
  arrange(desc(Betweenness)) %>% 
  select(name, Betweenness, party) %>% 
  as_tibble() %>% 
  head(20) %>% 
  knitr::kable()
name Betweenness party
tschipanski 26989.810 CDU/CSU
s_muenzenmaier 19464.496 AfD
stbrandner 18131.114 AfD
olliluksic 14463.042 FDP
rkiesewetter 12089.569 CDU/CSU
_martinneumann 8452.709 FDP
fabiodemasi 8202.163 DIE LINKE.
joanacotar 8010.589 AfD
hubertus_heil 7998.812 SPD
c_lindner 7090.924 FDP
gtzfrmming 6647.937 AfD
timon_gremmels 6566.856 SPD
wanderwitz 6504.270 CDU/CSU
matthiashauer 6361.569 CDU/CSU
udohemmelgarn 6303.254 AfD
mgrossebroemer 6059.420 CDU/CSU
sven_kindler 5791.922 Grüne
larsklingbeil 5217.861 SPD
drandreasnick 4975.422 CDU/CSU
konstantinnotz 4778.380 Grüne
betweenness_retweet_network %>% 
  as_tibble() %>% 
  ggplot(aes(x = Betweenness)) + 
  geom_histogram() +
  theme_custom()

betweenness_retweet_network %>% 
  as_tibble() %>% 
  ggplot(aes(x = Betweenness)) + 
  geom_histogram() +
  scale_x_log10() +
  theme_custom()

betweenness_retweet_network %>% 
  filter(!party %in% c("Fraktionslos", NA)) %>%
  filter(Betweenness != 1) %>% 
  as_tibble() %>% 
  ggplot(aes(x = Betweenness, color = party)) + 
  facet_wrap(~ party) +
  geom_density() +
  scale_x_log10() +
  # geom_density() +<
  scale_color_manual(values = party_colors) +
  theme_custom() +
  labs(y = "")

Selecting Subgraph for Visualization

Selecting a subgraph (sub-network) based on the largest component.

Community Detection

Since we are dealing with a directed network, not every type of community detection algorithm is appropriate.

retweet_network_group_betweenness <- retweet_network %>% 
  mutate(group = group_edge_betweenness(n_groups = 6))

Community Detection in Relationship with Party Affiliation

Further Work

Relationship of Centrality and Social Media Profile Influence

The echo: false option disables the printing of code (only output is displayed).